feat: support appstream data keywords and categories for FTS by m1rm · Pull Request #204 · archlinux-de/www.archlinux.de

m1rm · 2026-04-12T13:24:55Z

TL;DR

Integrates Arch Linux AppStream metadata into the Go app: keywords and categories columns on package, filled from upstream Components-x86_64.xml.gz on sources.archlinux.org, exposed to FTS5 search with tuned BM25 weights. Adds a update-appstream CLI command and just update-appstream / inclusion in just update-data.

Motivation

Improve package search (including German-oriented terms) using AppStream and data.
Keep the implementation streaming and aligned with existing update jobs (pacmandb-style callback parsing).

Data source & versioning

Snapshot directory (e.g. 20260326): https://archlinux.org/packages/extra/any/archlinux-appstream-data/json/ (pkgver).
Per-repo XML: {APPSTREAM_SOURCES_BASE}{pkgver}/{core|extra|multilib}/Components-x86_64.xml.gz
default base: https://sources.archlinux.org/other/packages/archlinux-appstream-data/

Behaviour

What is indexed

Keywords: text from only (not name/summary/description).
Categories: text from only (not pacman groups).
Language:
- blocks without xml:lang (neutral),
- en / de (including BCP47 prefixes like de-DE),
- and the same rules on individual / when present.
Stopwords: English + German closed-class words stripped in dedupeWords before storage (dedupe is case-insensitive)

Database & search

Migrations add keywords and categories on package, extend package_fts with matching columns, rebuild after changes.
update-packages upserts do not overwrite keywords / categories (same pattern as popularity).
Search uses BM25 with higher weight on name/description than on keywords / categories to limit dilution from AppStream text.

Operations

go run . update-appstream (requires DATABASE; optional APPSTREAM_SOURCES_BASE).
just update-appstream; just update-data runs update-appstream after update-packages.

Testing

Unit tests for XML parsing (keywords, categories, xml:lang on blocks and elements), keywordLangAccepted, stopwords, dedupeWords.
Tests updated for new FTS column list; just test / just lint pass.

Follow-ups (optional)

If we chose to keep both keywords and categories, we could merge the migrations into one single migration
Surface categories in the package detail UI if desired.
Revisit BM25 weights after production metrics.

…a (summary, description etc.) was too messy and thus messed up search rankings

beautify

…est into main exists and pushes are added to branch with open PR piggyback: improve ci setup; account for duplicate runs tryout to fix duplicate ci runs the other way

flush already hands ownership off to the caller (p.cur is nilled), so append-clone of keywords/categories was pure overhead per <component>. Strict=false hid real malformed-input errors; the upstream feed is well-formed XML.

Two parallel maps keyed by pkgname plus a third union map to iterate was wasted memory and an extra pass. One map of {kw, cat} slices covers it.

The http.Client parameter on Update was always passed nil from main; the fallback client timeouts (15m / 2m) were pure dead code since the ctx deadline (10m in runCommand) always wins. Match the convention of the other update commands: no client param, no hand-rolled timeouts. Also unexport latestRelease — it's not called from outside the package.

*gzip.Reader already satisfies io.Reader.

Replace the manual stack + 6 skip flags with the decoder's own Skip(): when a <keywords>/<keyword>/<categories>/<category> tag has a non-en/de xml:lang, skip its entire subtree in one call. The remaining state is five booleans tracking the enter/leave of accepted elements. No behavior change; all existing tests pass.

Upstream archlinux-appstream-data publishes core/extra/multilib only, so core-testing/extra-testing/multilib-testing packages never get keywords or categories columns populated. Document the asymmetry so search-ranking surprises don't lead down a wrong debugging path.

Extract the DB-facing portion of Update into applyTerms so tests can drive it directly against in-memory SQLite without mocking the HTTP fetches. Covers: - Keywords + categories land on matching package rows; non-mentioned packages stay empty. - FTS matches on both the new keyword and category columns after the rebuild — catches column-order drift between the schema and the query. - A second run clears stale data from rows no longer in the accumulator. - Duplicates and stopwords are stripped by dedupeWords. - All-stopword input does not update the row at all.

m1rm self-assigned this Apr 12, 2026

m1rm marked this pull request as draft April 12, 2026 13:25

m1rm changed the title ~~feat: initial implementation for appstream data fetcher in Golang~~ feat: support appstream data keywords and categories for FTS Apr 12, 2026

m1rm marked this pull request as ready for review April 12, 2026 14:26

m1rm force-pushed the feat/integrate-appstream-data branch from 01c46f3 to 9157312 Compare April 12, 2026 14:36

m1rm and others added 16 commits April 13, 2026 11:49

feat: initial implementation for appstream data fetcher in Golang

5e210b7

fix errors

601206f

dev: add just recipes for appstream data for convenience and consistency

e73e135

dev: document how the code works

a5e0628

feat: account for common words not needed for FTS

1c2c8e4

refactor: use appstream data keywords only for keywords bc. other dat…

a238710

…a (summary, description etc.) was too messy and thus messed up search rankings

feat: add support for appstream data categories

812b7e9

refactor: remove odd duplication of appstream data base url

71fec7d

chore: simplify & beautify

87f2ac8

beautify

dev: improve ci runs; run once either on push in main or if pull requ…

86582db

…est into main exists and pushes are added to branch with open PR piggyback: improve ci setup; account for duplicate runs tryout to fix duplicate ci runs the other way

appstream: drop redundant slice copy and non-strict XML mode

e93aa06

flush already hands ownership off to the caller (p.cur is nilled), so append-clone of keywords/categories was pure overhead per <component>. Strict=false hid real malformed-input errors; the upstream feed is well-formed XML.

appstream: merge per-name accumulators into one map

10d30ed

Two parallel maps keyed by pkgname plus a third union map to iterate was wasted memory and an extra pass. One map of {kw, cat} slices covers it.

architecture: document update-appstream subcommand

fc3449d

appstream: drop no-op io.Reader conversion

5c4047f

*gzip.Reader already satisfies io.Reader.

pierres force-pushed the feat/integrate-appstream-data branch from 9157312 to 3c5646d Compare April 13, 2026 10:25

pierres added 2 commits April 13, 2026 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support appstream data keywords and categories for FTS#204

feat: support appstream data keywords and categories for FTS#204
m1rm wants to merge 18 commits intomainfrom
feat/integrate-appstream-data

m1rm commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

m1rm commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Motivation

Data source & versioning

Behaviour

What is indexed

Database & search

Operations

Testing

Follow-ups (optional)

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

m1rm commented Apr 12, 2026 •

edited

Loading